anuary 31
adabmDCA 2.0 -- a flexible but easy-to-use package for Direct Coupling Analysis
Rosset, Lorenzo, Netti, Roberto, Muntoni, Anna Paola, Weigt, Martin, Zamponi, Francesco
In this methods article, we provide a flexible but easy-to-use implementation of Direct Coupling Analysis (DCA) based on Boltzmann machine learning, together with a tutorial on how to use it. The package \texttt{adabmDCA 2.0} is available in different programming languages (C++, Julia, Python) usable on different architectures (single-core and multi-core CPU, GPU) using a common front-end interface. In addition to several learning protocols for dense and sparse generative DCA models, it allows to directly address common downstream tasks like residue-residue contact prediction, mutational-effect prediction, scoring of sequence libraries and generation of artificial sequences for sequence design. It is readily applicable to protein and RNA sequence data.
An Information Retrieval and Extraction Tool for Covid-19 Related Papers
Background: The COVID-19 pandemic has caused severe impacts on health systems worldwide. Its critical nature and the increased interest of individuals and organizations to develop countermeasures to the problem has led to a surge of new studies in scientific journals. Objetive: We sought to develop a tool that incorporates, in a novel way, aspects of Information Retrieval (IR) and Extraction (IE) applied to the COVID-19 Open Research Dataset (CORD-19). The main focus of this paper is to provide researchers with a better search tool for COVID-19 related papers, helping them find reference papers and hightlight relevant entities in text. Method: We applied Latent Dirichlet Allocation (LDA) to model, based on research aspects, the topics of all English abstracts in CORD-19. Relevant named entities of each abstract were extracted and linked to the corresponding UMLS concept. Regular expressions and the K-Nearest Neighbors algorithm were used to rank relevant papers. Results: Our tool has shown the potential to assist researchers by automating a topic-based search of CORD-19 papers. Nonetheless, we identified that more fine-tuned topic modeling parameters and increased accuracy of the research aspect classifier model could lead to a more accurate and reliable tool. Conclusion: We emphasize the need of new automated tools to help researchers find relevant COVID-19 documents, in addition to automatically extracting useful information contained in them. Our work suggests that combining different algorithms and models could lead to new ways of browsing COVID-19 paper data.
Approximating DTW with a convolutional neural network on EEG data
Lerogeron, Hugo, Picot-Clemente, Romain, Rakotomamonjy, Alain, Heutte, Laurent
Dynamic Time Wrapping (DTW) is a widely used algorithm for measuring similarities between two time series. It is especially valuable in a wide variety of applications, such as clustering, anomaly detection, classification, or video segmentation, where the time-series have different timescales, are irregularly sampled, or are shifted. However, it is not prone to be considered as a loss function in an end-to-end learning framework because of its non-differentiability and its quadratic temporal complexity. While differentiable variants of DTW have been introduced by the community, they still present some drawbacks: computing the distance is still expensive and this similarity tends to blur some differences in the time-series. In this paper, we propose a fast and differentiable approximation of DTW by comparing two architectures: the first one for learning an embedding in which the Euclidean distance mimics the DTW, and the second one for directly predicting the DTW output using regression. We build the former by training a siamese neural network to regress the DTW value between two time-series. Depending on the nature of the activation function, this approximation naturally supports differentiation, and it is efficient to compute. We show, in a time-series retrieval context on EEG datasets, that our methods achieve at least the same level of accuracy as other DTW main approximations with higher computational efficiency. We also show that it can be used to learn in an end-to-end setting on long time series by proposing generative models of EEGs.
Transfer learning for chemically accurate interatomic neural network potentials
Zaverkin, Viktor, Holzmüller, David, Bonfirraro, Luca, Kästner, Johannes
Developing machine learning-based interatomic potentials from ab-initio electronic structure methods remains a challenging task for computational chemistry and materials science. This work studies the capability of transfer learning, in particular discriminative fine-tuning, for efficiently generating chemically accurate interatomic neural network potentials on organic molecules from the MD17 and ANI data sets. We show that pre-training the network parameters on data obtained from density functional calculations considerably improves the sample efficiency of models trained on more accurate ab-initio data. Additionally, we show that fine-tuning with energy labels alone can suffice to obtain accurate atomic forces and run large-scale atomistic simulations, provided a well-designed fine-tuning data set. We also investigate possible limitations of transfer learning, especially regarding the design and size of the pre-training and fine-tuning data sets. Finally, we provide GM-NN potentials pre-trained and fine-tuned on the ANI-1x and ANI-1ccx data sets, which can easily be fine-tuned on and applied to organic molecules.
A tutorial on ensembles and deep learning fusion with MNIST as guiding thread: A complex heterogeneous fusion scheme reaching 10 digits error
Tabik, S., Alvear-Sandoval, R. F., Ruiz, M. M., Sancho-Gómez, J. L., Figueiras-Vidal, A. R., Herrera, F.
Ensemble methods have been widely used for improving the results of the best single classification model. Indeed, a large body of works have achieved better results mainly by applying one specific ensemble method. However, very few works analyze complex fusion schemes using heterogeneous ensemble strategies. This paper is three-fold: 1) It provides a tutorial of the most popular ensemble methods, 2) analyzes the best ensembles using MNIST as guiding thread and 3) shows that complex fusion architectures based on heterogeneous ensembles can be considered as a mode of taking benefit from diversity. We introduce a complex fusion design that achieves a new record in MNIST with only 10 misclassified images.